Skip to content

[CELEBORN-2257] Fixed remote disks not being reported on registration#1

Open
Dzeri96 wants to merge 2 commits intoQbeast-io:mainfrom
Dzeri96:CELEBORN-2257
Open

[CELEBORN-2257] Fixed remote disks not being reported on registration#1
Dzeri96 wants to merge 2 commits intoQbeast-io:mainfrom
Dzeri96:CELEBORN-2257

Conversation

@Dzeri96
Copy link

@Dzeri96 Dzeri96 commented Feb 5, 2026

What changes were proposed in this pull request?

  1. Disks reported to the master on registration now include remote disks (HDFS, S3, OSS)
  2. Refactored method names to clarify difference between local and remote disks.
  3. Embedded disk type information into the enum.
  4. Refactored unnecessarily complicated code in the slot assignment and worker registration path.

Why are the changes needed?

  1. Before the first heartbeat, the master won't be able to assign slots from the remote disks on the worker.
  2. All other changes are in preparation for better support of remote disks.

Does this PR resolve a correctness bug?

Yes

Does this PR introduce any user-facing change?

No

How was this patch tested?

Test suite. The code hasn't been deployed so far.

Copy link
Collaborator

@eolivelli eolivelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add some unit test that shows the problem and that fails without this patch.
From the code it is not clear to me that the change actually solves the issue.

I am not even sure if the Worker needs to advertise the virtual dummy disks to the Master, is this useful ?

}
}

// TODO: Move all accesses to Type.getMask()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please do not add new TODOs
or at least create an issue and link to it

in this case I think that it is better to not add the TODO

for (newDisk <- newDiskInfos.values().asScala) {
val mountPoint: String = newDisk.mountPoint
val curDisk = diskInfos.get(mountPoint)
// TODO: Avoid function side-effect
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment as above: TODOs are not a good practice for a project handled by a community

@Dzeri96
Copy link
Author

Dzeri96 commented Feb 5, 2026

@eolivelli So this is important, because in SlotsAllocator, the master provides slots based on the disks that workers have. If it doesn't report a remote disk, the master won't be able to use remote storage before the first heartbeat. This is annoying in the beginning, but could be a big problem when auto-scaling.

Regarding tests, I wanted to submit the PR and ask for their opinion on what kind of test to write. My code fixes a problem that's present when a Worker and a Master communicate, so maybe an integration tests makes more sense.

@eolivelli
Copy link
Collaborator

Regarding tests, I wanted to submit the PR and ask for their opinion on what kind of test to write. My code fixes a problem that's present when a Worker and a Master communicate, so maybe an integration tests makes more sense.

The best test is anything that reproduces the problem, so that you can demonstrate that your code actually fixes it.
So an end to end test (integration test in this case) should be the right way.

@Dzeri96
Copy link
Author

Dzeri96 commented Feb 6, 2026

The problem that I have is that I don't have the right test infrastructure to test this. This is why I think it's time to open a PR with the main repo and ask them there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants